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Abstract 

We revisit various string indexing problems with range reporting features, namely, position- 
restricted substring searching, indexing substrings with gaps, and indexing substrings with in- 
tervals. We obtain the following main results. 

• We give efficient reductions for each of the above problems to a new problem, which we 
call substring range reporting. Hence, we unify the previous work by showing that we may 
restrict our attention to a single problem rather than studying each of the above problems 
individually. 

• We show how to solve substring range reporting with optimal query time and little space. 
Combined with our reductions this leads to significantly improved time-space trade-offs 
for the above problems. In particular, for each problem we obtain the first solutions with 
optimal time query and 0(n\og ^ n) space, where n is the length of the indexed string. 

• We show that our techniques for substring range reporting generalize to substring range 
counting and substring range emptiness variants. We also obtain non-trivial time-space 
trade-offs for these problems. 

Our bounds for substring range reporting are based on a novel combination of suffix trees and 
range reporting data structures. The reductions are simple and general and may apply to other 
combinations of string indexing with range reporting. 



1 Introduction 

Given a string S of length n the string indexing problem is to preprocess S into a compact rep- 
resentation that efficiently supports substring queries, that is, given another string P of length m 
report all occurrences of substrings in S that match P. Combining the classic suffix tree data struc- 



ture [14| with perfect hashing 13 leads to an optimal time-space trade-off for string indexing, i.e., 
an O(n) space representation that supports queries in 0(m + occ) time, where occ is the number 
of occurrences of P in S. 

In recent years, several extensions of string indexing problems that add range reporting features 
have been proposed. For instance, Makinen and Navarro proposed the position-restricted substring 



searching problem [21,22 . Here, queries take an additional range [a, b] of positions in S and the 
goal is to report the occurrences of P within S [a, b}. For such extensions of string indexing no 
optimal time-space trade-off is known. For instance, for position-restricted substring searching one 
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can either get 0(n log 6 n) space (for any constant e > 0) and 0(m + log log n + occ) query time or 
0(n 1+e ) space with 0{m + occ) query time (8 , 21 22 . Hence, removing the log log n term in the 



query comes at the cost of significantly increasing the space. 

In this paper, we revisit a number string indexing problems with range reporting features, 
namely position-restricted substring searching, indexing substrings with gaps, and indexing sub- 
strings with intervals. We achieve the following results. 

• We give efficient reductions for each of the above problems to a new problem, which we 
call substring range reporting. Hence, we unify the previous work by showing that we may 
restrict our attention to a single problem rather than studying each of the above problems 
individually. 

• We show how to solve substring range reporting with optimal query time and little space. 
Combined with our reductions this leads to significantly improved time-space trade-offs for 
all of the above problems. For instance, we show how to solve position-restricted substring 
searching in 0{n log e n) space and 0(m + occ) query time. 

• We show that our techniques for substring range reporting generalize to substring range 
counting and substring range emptiness variants. We also obtain non-trivial time-space trade- 
offs for these problems. 

Our bounds for substring range reporting are based on a novel combination of suffix trees and 
range reporting data structures. The reductions are simple and general and may apply to other 
combinations of string indexing with range reporting. 



1.1 Substring Range Reporting 

Let 5 be a string where each position is associated with a integer value in the range [0,u]. The 
integer associated with position i in S is the label of position i, denoted label(z), and we call S 
a labeled string. Given a labeled string S, the substring range reporting problem is to compactly 
represent S while supporting substring range reporting queries, that is, given a string P and a pair 
of integers a and b, < a < b < u, report all starting positions in S that match P and whose labels 
are in the range [a, b] . 

We assume a standard unit-cost RAM model with word size w and a standard instruction set 
including arithmetic operations, bitwise boolean operations, and shifts. We assume that a label 
can be stored in a constant number of words and therefore w = Q(logu). The space complexity is 
the number of words used by the algorithm. All bounds mentioned in this paper are valid in this 
model of computation. 

To solve substring range reporting a basic approach is to combine a suffix tree with a 2D range 
reporting data structure. A query for a pattern P and range [a, b] consists of a search in the suffix 
tree and then a 2D range reporting query with [a, b] and the lexicographic range of suffixes defined 
P. This is essentially the overall approach used in the known solutions for position-restricted 
substring searching 0[§|[9|[^[22||3l] , which is a special case of substring range reporting (see the 
next section). 

Depending on the choice of the 2D range reporting data structure this approach leads to different 
trade-offs. In particular, if we plug in the 2D range reporting data structure of Alstrup et al. [2], 
we get a solution with 0(nlog e n) space and 0(m + log log u + occ) query time (see Makinen 
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and Navarro |21[|22| ). The log log u term in the query time is from the range reporting query. 
Alternatively, if we use a fast data structure for the range successor problem |8,3l] to do the range 
reporting, we get optimal 0(m+occ) query time but increase the space to at least f2(n 1+e ). Indeed, 
since any 2D range reporting data structure with 0(n \og 0<yl ^ n) space must use fJ(loglogii) query 
time [26], we cannot hope to avoid this blowup in space with this approach. 

Our first main contribution is a new and simple technique that overcomes the inherent problem 
of the previous approach. We show the following result. 

Theorem 1 Let S be a labeled string of length n with labels in the range [0,u]. For any constants 
e,6 > 0, we can solve substring range reporting using 0(n(log £ n + log log u)) space, 0(n(logn + 
log 5 it)) expected preprocessing time, and 0(m + occ) query time, for a pattern string of length m. 

Compared to the previous results we achieve optimal query time with an additional 0{n log log u) 
term in the space. For the applications considered here, we have that u = 0{n) and therefore the 
space bound simplifies to 0(n(log e re + log log u)) = 0(nlog e n). Hence, in this case there is no 
asymptotic space overhead. 

The key idea to obtain Theorem [T] is a new and simple combination of suffix trees with multiple 
range reporting data structures for both 1 and 2 dimensions. Our solution handles queries differently 
depending on the length of the input pattern such that the overall query is optimized accordingly. 

Interestingly, the idea of using different query algorithms depending on the length of the pattern 
is closely related to the concept of filtering search introduced for the standard range reporting 
problem by Chazelle as early as 1986 |6|. Our new results show that this idea is also useful in 
combinatorial pattern matching. 

Finally, we also consider substring range counting and substring range emptiness variants. Here, 
the goal is to count the number of occurrences in the range and to determine whether or not the 
range is empty, respectively. Similar to substring range reporting, these problems can also be 
solved in a straightforward way by combining a suffix with a 2D range counting or emptiness data 
structure. We show how to extend our techniques to obtain improved time-space trade-offs for both 
of these problems. 

1.2 Applications 

Our second main contribution is to show that substring range reporting actually captures several 
other string indexing problems. In particular, we show how to reduce the following problems to 
substring range reporting. 

• Position-restricted substring searching: Given a string S of length n, construct a data struc- 
ture supporting the following query: Given a string P and query interval [a, b], with 1 < a < 
b < n, return the positions of substrings in S matching P whose positions are in the interval 
[a,b]. 

• Indexing substrings with intervals: Given a string S of length n, and a set of intervals it = 
{[si,/i], [s 2 ,/2], • • • , [s|7r|>/|7r|]} such that s i} fi G [l,n] and Si < fc, for all 1 < i < \ir\, construct 
a data structure supporting the following query: Given a string P and query interval [a, b], 
with 1 < a < b < n, return the positions of substrings in S matching P whose positions are 
in [a, b] and in one of the intervals in ir. 
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• Indexing substrings with gaps: Given a string S of length n and an integer d, the problem is 
to construct a data structure supporting the following query: Given two strings Pi and Pi 
return all positions of substrings in S matching Pi o -k d o P 2 . Here o denotes concatenation 
and * is a wildcard matching all characters in the alphabet. 



Previous results Let m be the length of P. Makinen and Navarro (2lJ[22] introduced the 
position-restricted substring searching problem. Their fastest solution uses 0(nlog e re) space, 
O(relogre) expected preprocessing time, and 0(m+loglogn+occ) query time. Crochemore et al. (8] 
proposed another solution using 0(re 1+e ) space, 0(n l+£ ) preprocessing time, and 0(m + occ) query 
time (see also Section Using techniques from range non-overlapping indexing |7| it is possible 



to improve these bounds for small alphabet sizes 27 . Several succinct versions of the problem have 



also been proposed |4| |21[[22||31] . All of these have significantly worse query time since they require 
superconstant time per reported occurrence. Finally, Crochemore et al. PLO] studied a restricted 
version of the problem with a = 1 or b = n. 

For the indexing substrings with intervals problem, Crochemore et al. |8,9| gave a solution 
with 0(relog 2 n) space, 0(\tt\ + nlog 3 re) expected preprocessing time, and 0(m + log log re + occ) 
query time. They also showed how to achieve 0(re 1+e ) space, 0(n 1+£ + \ir\) preprocessing time, 
and 0(m + occ) query time. Several papers (3 (17,20] have studied the property matching problem, 
which is similar to the indexing substrings with intervals problem, but where both start and end 
point of the match must be in the same interval. 

Iliopoulos and Rahman (18] studied the problem of indexing substrings with gaps. They gave a 
solution using 0(relog £ re) space, O(relogre) expected preprocessing time, and 0(m + loglogre-|-occ) 
query time, where m is the length of the two input strings. Crochemore and Tischler recently 



proposed a variant of the problem 11 



Our results We reduce position-restricted substring searching, indexing substrings with intervals, 
and indexing substrings with gaps to substring range reporting. Applying Theorem [T] with our new 
reductions, we get the following result. 

Theorem 2 Let S be a string of length n and let m be the length of the query. For any constant 
e > 0, we can solve 

(i) Position-restricted substring searching using 0(relog e re) space, O (re log re) expected preprocess- 
ing time, and 0(m + occ) query time. 

(it) Indexing substrings with intervals using 0(relog £ re) space, 0(\ir\ +relogre) expected prepro- 
cessing time, and 0(m + occ) query time. 

(Hi) Indexing substrings with gaps using 0(relog e re) space, O(relogre) expected preprocessing time, 
and 0(m + occ) query time (m is the size of the two input strings). 

This improves the best known time-space trade-offs for all three problems, that all suffer from the 
trade-off inherent in 2D range reporting. 

The reductions are simple and general and may apply to other combinations of string indexing 
with range reporting. 
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2 Basic Concepts 



2.1 Strings and Suffix Trees 

Throughout the section we will let S be a labeled string of length \S\ = n with labels in [0, u]. We 
denote the character at position i by S[i] and the substrings from position i to j by S[i,j]. The 
substrings j] and 5[i,n] are the prefixes and suffixes of 5, respectively. The reverse of S is S R . 
We denote the label of position i by labels(z). The order of suffix S[i,n], denoted orders(i), is the 
lexicographic order of 5"[i,n] among the suffixes of S. 

The suffix tree for S, denoted Tg, is the compacted trie storing all suffixes of S [l4j. The depth 
of a node v in T$ is the number of edges on the path from v to the root. Each of the edges in T$ 
is associated with some substring of 5. The children of each node are sorted from left to right in 
increasing alphabetic order of the first character of the substring associated with the edge leading 
to them. The concatenation of substrings from the root to v is denoted strs(v). The string depth 
of v, denoted strdepth^v), is the length of strs(u)- The locus of a string P, denoted locuss(-P), is 
the minimum depth node v such that P is a prefix of strs(i>)- If P is not a prefix of a substring in 
S we define locuss(P) to be _L. 

Each leaf £ in T$ uniquely corresponds to a suffix in S, namely, the suffix strs(^). Hence, we 
will use labels (£) and orders (£) to refer to the label and order of the corresponding suffix. For an 
internal node v we extend the notation such that 

labels(w) = {labels(^) | £ is a descendant leaf of v} 
orders(v) = {orders^) | £ is a descendant leaf of v}. 

Since children of a node are sorted, the left to right order of the leaves in T$ corresponds to the 
lexicographic order of the suffixes of S. Hence, for any node v, orders (v) is an interval. We denote 
the left and right endpoints of this interval by l v and r v . When the underlying string S is clear 
from the context we will often drop the subscript s for brevity. 

The suffix tree for S uses 0(n) space and can be constructed in (9(sort(n)) time, where sort(n) 
is the time for sorting n values in the model of computation |12|. We only need a standard 
comparison-based 0(n log n) suffix tree construction in our results. Let P be a string of length m. 
If locuss(-P) = -L then P does not occur as a substring in S. Otherwise, the substrings in S that 
match P are the suffixes in orders (locuss(-P))- Hence, we can compute all occurrences of P in S 
by traversing the suffix tree from the root to locuss(-P) and then report all suffixes stored in the 
subtree. Using perfect hashing [13] to represent the outgoing edges of each node in Ts we achieve 
an 0{n) solution to string indexing that supports queries in 0{m + occ) time (here occ is the total 
number of occurrences of P in S) . 

2.2 Range Reporting 

Let X C {0, . . . , u} d be a set of points in a d-dimensional grid. The range reporting problem in 
d-dimensions is to compactly represent X while supporting range reporting queries, that is, given a 
rectangle R = [oi, b\] x • • • x [ad,bd] report all points in the set RdX. We use the following results 
for range reporting in 1 and 2 dimensions. 

Lemma 1 (Alstrup et al. |1|, Mortensen et al. |24|) For a set of n points in [0,u] and any 
constant 7 > 0, we can solve ID range reporting using 0(n) space, 0(nlog 7 u) expected preprocess- 
ing time and 0(1 + occ) query time. 
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Lemma 2 (Alstrup et al. |2|) For a set of n points in [0,u] x [0, u] and any constant £ > 0, we 
can solve 2D range reporting using 0(n\og s n) space, 0(n log n) expected preprocessing time, and 
O^oglog-u + occ) query time. 



3 Substring Range Reporting 

We now show Theorem [I] Recall that S is a labeled string of length n with labels from [0,u]. 
3.1 The Data Structure 

Our substring range reporting data structure consists of the following components. 

• The suffix tree T$ for S. For each node v in T$ we also store l v and r v . We partition T$ into 
a top tree and a number of bottom trees. The top tree consists of all nodes in T$ whose string 
depth is at most log log u and all their children. The trees induced by the remaining nodes 
are the forest of bottom trees. 

• A 2D range reporting data structure on the set of points { (orders (z), labels (i)) | i G {l,...,n}}. 

• For each node v in the top tree, a ID range reporting data structure on the set {labels(«) | 
i G orders(f)}. 

We analyze the space and preprocessing time for the data structure. We use the range reporting 
data structures from Lemmas [T] and [2j The space for the suffix tree is O(n) and the space for the 
2D range reporting data structure is 0(n log e n), for any constant e > 0. We bound the space for 
the (potentially O(n)) ID range reporting data structures stored for the top tree. Let Vd be the 
set of nodes in the top tree with depth d. Since the sets orders(f), v G Vd, partition the set of 
descendant leaves of nodes in Vd, the total size of these sets is as most n. Hence, the total size 
of the ID range reporting data structures for the nodes in Vd is therefore 0(n). Since there are 
at most log log u + 1 levels in the top tree, the space for all ID range reporting data structures is 
0(n log log u). Hence, the total space for the data structure is 0(n(log £ n + log log u)). 

We can construct the suffix tree in 0(sort(n)) time and the 2D range reporting data structure 
in 0{n log n) expected time. For any constant 7 > 0, the expected preprocessing time for all ID 
range reporting data structures is 



3.2 Substring Range Queries 

Let P be a string of length m, and let a and b be a pair of integers, < a < b < u. To answer a 
substring range query we want to compute the set of starting positions for P whose labels are in 
[a, b]. First, we compute the node v = locuss(-P). If v = _L then P is not a substring of S, and we 
return the empty set. Otherwise, we compute the set of descendant leaves of v with labels in [a, b\. 
There are two cases to consider. 




Setting (5 = 27 we use 0(n(logn + log 5 u)) expected preprocessing time in total. 
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(i) If v is in the top tree we query the ID range reporting data structure for v with the interval 
[a,b\. 



(ii) If v is in a bottom tree, we query the 2D range reporting data with the rectangle [l v , r v ] x [a, b]. 

Given the points returned by the range reporting data structures, we output the corresponding 
starting positions of the corresponding suffixes. From the definition of the data structure it follows 
that these are precisely the occurrences of P within the range [a, b]. Next consider the time 
complexity. We find locuss(-P) in 0(m) time (see Section[2]). In case (i) we use 0(1 + occ) time to 
compute the result by Lemma [TJ Hence, the total time for a substring range query for case (i) is 
0(m + occ). In case (ii) we use 0(loglogu + occ) time to compute the result by Lemma[2| We have 
that v = locuss(-P) is in a bottom tree and therefore m > strdepth(parent(locus,g(t> ))) > log log u. 
Hence, the total time to answer a substring range query in case (ii) is 0(m + log log u + occ) = 
0{m + occ). Thus, in both cases we use 0(m + occ) time. 

Summing up, our solution uses 0(n(log £ n + log log u) space, 0(n(logn + log 5 u)) expected 
preprocessing time, and 0(m + occ) query time. This completes the proof of Theorem [TJ 

4 Applications 

In this section we show how to improve the results for the three problems position-restricted sub- 
string searching, indexing substrings with intervals, and indexing gapped substrings, using our data 
structure for substring range reporting. Let REPORTs(i- > , a, b) denote a substring range reporting 
query on string S with parameters P, a, and b. 

4.1 Position- Restricted Substring Searching 

We can reduce position-restricted substring searching to substring range reporting by setting 
label(z) = i for all i = 1, . . . , n. To answer a query we return the result of the substring range 
query report s (P, a, b). Since each label is equal to the position, it follows that the solution to 
the substring range reporting instance immediately gives a solution to position-restricted substring 
searching. Applying Theorem [TJ with u = n, this proves Theorem [2]JI]) . 

4.2 Indexing Substrings with Intervals 

We can reduce indexing substrings with intervals to substring range reporting by setting 

I i if i G (/J for some ip G it, 
label(?) = < 

I otherwise. 

To answer a query we return the result of the substring range reporting query report $(P, a, b). 
Let I be the solution to the indexing substrings with intervals instance and let i 7 be the solution 
to the substring range reporting instance derived by the above reduction. Then i G I i 6 I' ■ 

To prove this assume i £ I. Then i G (p for some ip G tt and i G [a, b] . From i G tp and the 
definition of label(i) it follows that label(i) = i. Thus, label(z) = i £ [a, b] and thus i G Assume 
i G Then label(z) G [a,b]. Since a > also label(i) > 0, and it follows that label(i) = i. By the 
reduction this means that i G p for some p G vr. Since i = label(z), we have i G [a, b] and therefore 
i G /. 
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Figure 1: A string 5, the labeling for d = 2 (below the string), and the suffix tree of T s r. Given 
a query Pi = ab and Pj = bac we find v = locusts (ba) (marked in the suffix tree). We have 
l v = 6 and r v = 7 from the left-to- right-order in the T s r. The substring range reporting query 
report s (P2; 6, 7) returns 7. Hence, we report the occurrence at position 7 — 2 — 2 = 3. 



We can construct the labeling in 0(n + \ir\) if the intervals are sorted by startpoint or endpoint. 
Otherwise additional time for sorting is needed. A similar approach is used in the solution by 
Crochemore et al. |8j. 

Applying Theorem [l] with u = n, this proves Theorem [2]Jii]) . 

4.3 Indexing Substrings with Gaps 

We can reduce the indexing substrings with gaps problem to substring range reporting as follows. 
Construct the suffix tree of the reverse of S, i.e., the suffix tree TgR for S R . For each node v in 
T s r also store l v and r v . Set 



labels(i) 



j order S fl(n — i + d + 2) for i > d + 2, 
1 otherwise. 



To answer a query find the locus node v of P R in T s r. Then use the substring range reporting 
data structure to return all positions of substrings in S matching P2 whose labels are in the range 
[Z„,r„]. For each position i returned by REPORTg(P2, l v , r v ), return i — \P\\ — d. See Fig. [T] for an 
example. 



Correctness of the reduction We will now show that the reduction is correct. Let I be the 
solution to the indexing substrings with gaps instance and let I' be the solution to the substring 
range reporting instance derived by the above reduction. We will show i £ I 4=> i £ I' . Let rrii = |Pj| 
for i = 1,2. 

If i G I then there is an occurrence of Pi at position i in S and an occurrence of P2 at 
position i' = i + mi + d in S. It follows directly, that there is an occurrence of P R at position 
i" = n — (i + mi) + 2 in S R . By definition, 

labels (i ) = labels^ + mi + d) = order 5 fl(n — (i + mi + d) + d + 2) = order 5 ij(i // ), 

where the second equality follows from the fact that i + mi + d > d + 2. Since there is an 
occurrence of P R at position i" in S R , we have labels (i') = order gR(i") G order s r(Iocus s r(P r )) . 
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Thus, labels(i') E [/ v ,r v ], and since there is an occurrence of P2 at position i' in S, we have 
i' — mi — d = i £ I' . 

If i € I' then there is an occurrence of P2 at position %' = % + m\ + d with label(z') in the range 
[l v ,r v ], where v = locus S R(Pf i ). We need to show that this implies that there is an occurrence of 
Pi at position i in S. By definition, 

labels^') = order S R(n — 1 + d + 2) = order S i?(n — i — mi + 2). 

Let i" = n — i — m\ + 2. Since order^^") = labels(i') S [iuj^o]) there is an occurrence of at 
position i" in S R . It follows directly, that there is an occurrence of Pi at position n — i" — mi + 2 = 
n — (n — i — mi + 2) — mi + 2 = i in S. Therefore, i £ I. 

Complexity Construction of the suffix tree T s r takes time 0(n log n) and the labeling can be 
constructed in time 0(n). Both use space 0(n). It takes 0{mi) time to find the locus nodes of 
P R in T s r. The substring range reporting query takes time 0(m,2 + occ). Thus the total query 
time is 0{m + occ). 

Applying Theorem [l] with u = n, this completes the proof of Theorem [2^iii). 

5 Substring Range Counting and Emptiness 

We now show how to apply our techniques to substring range counting and substring range empti- 
ness. Analogous to substring range reporting, the goal is here to count the number of occurrences 
in the range and to determine whether or not the range is empty, respectively. A straightforward 
way to solve these problems is to combine a suffix tree with a 2D range counting data structure 
and a 2D range emptiness data structure, respectively. Using the techniques from Section [3] we 
show how to significantly improve the bounds of this approach in both cases. We note that by 
the reductions in Section [4] the bounds for substring range counting and substring range emptiness 
also immediately imply results for counting and emptiness versions of position-restricted substring 
searching, indexing substrings with intervals, and indexing substrings with gaps. 

5.1 Preliminaries 

Let X C {0, . . . , u} be a set of points in a d-dimensional grid. Given a query rectangle R = 
[ai,bi] x ••• x [a^,6rf], a range counting query computes \R fl X\, and a range emptiness query 
computes if R fl X = 0. Given X the range counting problem and the range emptiness problem is 
to compactly represent X, while supporting range counting queries and range emptiness queries, 
respectively. Note that any solution for range reporting or range counting implies a solution for 
range emptiness with the same complexity (ignoring the occ term for range reporting queries). We 
will need the following additional geometric data structures. 

Lemma 3 (JaJa et al. |19|) For a set ofn points in [0, u] x [0, u] we can solve 2D range counting 
in 0(n) space, 0(n log n) preprocessing time, and 0(logn/ log log n + log log u) query time. 

Lemma 4 (van Emde Boas et al. |29| |30| , Mehlhorn and Naher |23|) For a set of n points 
in [0,u] we can solve ID range counting in 0(n) space, 0(n log log n) preprocessing time, and 
O(loglogu) query time. 
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To achieve the result of Lemma [4] we use a van Emde Boas data structure 29 , 30 implemented 
in linear space |23| using perfect hashing. This data structure supports predecessor queries in 
O(loglog'u) time. By also storing for each point it's rank in the sorted order of the points, we can 
compute a range counting query by two predecessor queries. To build the data structure efficiently 
we need to sort the points and build suitable perfect hash tables. We can sort deterministically in 
O(nloglogn) time [16| , and we can build the needed hash tables in 0(n) time using deterministic 
hashing [15] combined with a standard two-level approach (see e.g., Thorup |28|). 

Lemma 5 (Chan et al. |5|) For a set ofn points in [0, u] x [0, it] we can solve 2D range emptiness 
in 0(n log log n) space, O(nlogn) preprocessing time, and O (log log u) query time. 



5.2 The Data Structures 

We now show how to efficiently solve substring range counting and substring range emptiness. 
Recall that S is a labeled string of length n with labels from [0, u]. 

We can directly solve substring range counting by combining a suffix tree with the 2D range 
counting result from Lemma[3j This leads to a solution using 0(n) space and 0(m+log n/ log log n+ 
log log u) query time. We show how to improve the query time to 0(m + log log u) at the cost of 
increasing the space to 0(n log nj log log n). Hence, we remove the log n/ log log n term from the 
query time at the cost of increasing the space by a log nj log log n factor. We cannot hope to achieve 
such a bound using a suffix tree combined with a 2D range counting data structure since any 2D 
range counting data structure using 0(n log ^ 1 ^ n) space requires £l(lognj log log n) query time [25] . 
We can also directly solve substring range emptiness by combining a suffix tree with the 2D range 
emptiness result from Lemma [5j This solution uses 0(n log log n) space and 0(m + loglogn) query 
time. We show how to achieve optimal 0(m) query time with space 0{n log log u). 

Our data structure for substring range counting and existence follows the construction in Sec- 
tion [3} We partition the suffix tree into a top and a number of bottom trees and store a ID data 
structure for each node in the top tree and a single 2D data structure. To answer a query for a 
pattern string P of length m, we search the suffix tree with P and use the ID data structure if the 
search ends in the top tree and otherwise use the 2D data structure. 

We describe the specific details for each problem. First we consider substring range counting. 
In this case the top tree consists of all nodes of string depth at most log nj log log n. The ID and 2D 
data structures used are the ones from Lemma [4] and [3] By the same arguments as in Section [3] the 
total space used for the ID data structures for all nodes in the top tree at depth d is at most 0{n) 
and hence the total space for all ID data structures is 0(n(log nj log log n)). Since the 2D data 
structure uses 0{n) space, the total space is 0(n log nj log log n). The time to build all ID data 
structures is 0(n(logn/loglogn) • log log n)) = 0(n log n). Since the suffix tree and the 2D data 
structure can be built within the same bound, the total preprocessing time is 0{n log n). Given 
a pattern of length m, a query uses 0(m + log log u) time if the search ends in the top tree, and 
0{m + log nj log log n + log log u) time if the search ends in a bottom tree. Since bottom trees 
consists of nodes of string depth more than log nj log log n the time to answer a query in both cases 
is 0(m + log log u). In summary, we have the following result. 

Theorem 3 Let S be a labeled string of length n with labels in the range [0,u]. We can solve 
substring range counting using 0(n log nj log log n) space, 0(n log n) preprocessing time, and 0{m+ 
log log u) query time, for a pattern string of length m. 
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Next we consider substring range emptiness. In this case the top tree consists of all nodes 
of string depth at most log log u. We use the ID and 2D data structures from Lemma [T] and 
Lemma [5j The total space for all ID data structures is 0(n log log u). Since the 2D data structure 
uses 0(n log log n) space the total space is 0(n log log u). For any constant 7 > 0, the expected time 
to build all ID data structures is 0(n log log u log 7 u) = 0{n log 5 u) for suitable constant 5 > 0. The 
suffix tree and the 2D data structure can be built in 0(n log n) time and hence the total expected 
preprocessing time is 0(n(logn + log 5 it)). If the search for a pattern string ends in the top tree 
the query time is 0{m) and if the search ends in a bottom tree the query time is 0(m + log log u). 
As above, the partition in top and bottom trees ensures that the query time in both cases is 0(m). 
In summary, we have the following result. 

Theorem 4 Let S be a labeled string of length n with labels in the range [0, u] . For any constant 

5 > we can solve substring range existence using 0(n log log u) space, 0(n(logn+log' 5 u)) expected 
preprocessing time, and 0{m) query time, for a pattern string of length m. 
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